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METHOD FOR ADAPTING QUANTIZATION IN VIDEO CODING USING 
FACE DETECTION AND VISUAL ECCENTRICITY WEIGHTING 

The present application is a continuation of co- 
5 pending patent application, Daly et al . , Serial 
No. 60/071,099, filed January 9, 1998. 
BACKGROUND OF THE INVENTION 

The present invention relates to a system for 
encoding facial regions of a video that incorporates a model 

10 of the human visual system to encode frames in a manner to 
provide a substantially uniform apparent quality. 

In many systems the number of bits available for 
encoding a video, consisting of a plurality of frames, is 
fixed by the bandwidth available in the system. Typically 

15 encoding systems use an ad hoc control technique to select 

quantization parameters that will produce a target number of 
bits for the video while simultaneously attempting to encode 
the video frames with the highest possible quality. For 
example, in digital video recording, a group of frames must 

2 0 occupy the same number of bits for an efficient fast- 

f orward/f ast-rewind capability. In video telephones, the 
channel rate, communication delay, and the size of the 
encoder buffer determines the number of available bits for a 
frame . 

2 5 There are numerous systems that address the 

problem of how to encode video to achieve high quality while 
controlling the number of bits used. The systems are 
usually known as rate, quantizer, or buffer control 
techniques and can be generally classified into three major 

30 classes. 

The first class are systems that encode each block 
of the image several times with a set of different 
quantization factors, measure the number of bits produced 
for each quantization factor, and then attempt to select a 
35 quantization factor for each block so that the total number 



of bits for all the blocks total a target number. While 
generally accurate, such a technique is not suitable for 
real -time encoding systems because of its high computational 
complexity. 

The second class are systems that measure the 
number of bits used in previously encoded image blocks, 
buffer fullness, block activity, and use all these measures 
to select a quantization factor for each block of the image. 
Such techniques are popular for real-time encoding systems 
because of their low computational complexity. 
Unfortunately, such techniques are quite inaccurate and must 
be combined with additional techniques to avoid bit or 
buffer overflows and underflows. 

The third class are systems that use a model to 
predict the number of bits necessary for encoding each of 
the image blocks in terms of the block's quantization factor 
and other simple parameters, such as block variances. These 
models are generally based on mathematical approximations or 
predefined tables. Such systems are computationally simple 
and are suitable for real-time systems, but unfortunately 
they are highly sensitive to inaccuracies in the model 
itself . 

Some rate control systems incorporate face 
detection. One of such systems, along with other systems 
that use face detection, is described below. 

Zhou, U.S. Patent No. 5,550,581, discloses a low 
bit rate audio and video communication system that 
dynamically allocates bits among the audio and video 
information based upon the perceptual significance of 
the audio and video information. For a video 
teleconferencing system Zhou suggests that the perceptual 
quality can be improved by allocating more of the video bits 
to encode the facial region of the person than the remainder 
of the scene. In addition, Zhou suggests that the mouth 
area, including the lips, jaw, and cheeks, should be 
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allocated more video bits than the remainder of the face 
because of the motion of these portions. In order to encode 
the face and mouth areas more accurately Zhou uses a 
subroutine that incorporates manual initialization of the 
5 position of each speaker within a video screen. 

Unfortunately/ the manual identification of the facial 
region is unacceptable for automated systems. 

Kosemura et al . , U.S. Patent No. 5,187,574, 
disclose a system for automatically adjusting the field of 

10 view of a television door phone in order to keep the head of 
a person centered in the image frame. The detection system 
relies on detecting the top of the person's head by 
comparing corresponding pixels in successive images. The 
number of pixels are counted along a horizontal line to 

15 determine the location of the head. However, such a head 
detection technique is not robust. 

Sexton, U.S. Patent No. 5,086,480, discloses a 
video image processing system in which an encoder identifies 
the head of a person from a head-against-a-background scene. 

2 0 The system uses training sequences and fits a minimum 
rectangle to the candidate pixels. The underlying 
identification technique uses vector quantization. 
Unfortunately, the training sequences require the use of an 
anticipated image which will be matched to the actual image. 

2 5 Unfortunately, if the actual image in the scene does not 
sufficiently match any of the training sequences then the 
head will not be detected. 

Lambert, U.S. Patent No. 5,012,522, discloses a 
system for locating and identifying human faces in video 

30 scenes. A face finder module searches for facial 

characteristics, referred to as signatures, using a 
template. In particular, the signatures searched for are 
the eye and nose/mouth, Unfortunately, such a template 
based technique is not robust to occlusions, profile 

35 changes, and variations in the facial characteristics. 
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Ueno et al . , U.S. Patent No. 4,951,140, discloses 
a facial region detecting circuit that detects a face based 
on the difference between two frames of a video using a 
histogram based technique. The system allocates more bits 
5 to the facial region that the remaining region. However, 

such a histogram based technique may not necessarily detect 
the face in the presence of significant motion. 

Moghaddam et al . , in a paper entitled "An 
Automatic System for Model-Based Coding of Faces, " IEEE Data 
10 Compression Conference, March 1995, discloses a system for 
two-dimensional image encoding of human faces. The system 
uses eigen-templates for template matching which is 
computationally intensive. 

Eleftheriadis et al . , in a paper entitled 
15 "Automatic Face Location Detection and Tracking for Model - 
Assisted Coding of Video Teleconferencing Sequences at Low 
Bit-Rates," Signal Processing: Image Communication 7 (1995), 
disclose a model-assisted coding technique which exploits 
the face location information of video sequences to 

2 0 selectively encode regions of the video to produce coded 

sequences in which the facial regions are clearer and 
sharper. In particular, the system initially differences 
two frames of a video to detect motion. Then the system 
attempts to locate the top of the head of a person by 
25 searching for a sequential series of non-zero horizontal 
pixels in the difference image, as shown in FIG. 11 of 
Eleftheriadis et al . A set of ellipses with various sizes 
and aspect ratios having their uppermost portion fixed at 
the potential location of the top of the head are fitted to 

3 0 the image data. Unfortunately, scanning the difference 

image for potential sequences of non-zero pixels is complex 
and time consuming. In addition, the system taught by 
Eleftheriadis et al . includes many design parameters that 
need to be selected for each particular system and video 



sequences making it difficult to adapt the system for 
different types of video sequences and systems. 

Glenn, in a chapter entitled "Real-Time Display 
Systems, Present and Future," from the book Visual Science 
Engineering, edited by O.H. Kelly, 1994, teaches a display 
system that varies the resolution of the image from the 
center to the edge, in the hope that the decrease in 
resolution would lead to a bandwidth reduction. The 
resolution decrease is accomplished by discarding pixel 
information to blur the image. The presumption in Glenn is 
that the observer is looking at the center of the display. 
The attempt was unsuccessful because although it was found 
that the observer's eyes tended to stay in the center one- 
quarter of the total image area, the resolution at the edges 
of the image could not be sufficiently reduced before the 
resulting blur was detectable. 

Browder et al . , in a paper entitled "Eye-Slaved 
Area-Of -Interest Display Systems: Demonstrated Feasible In 
The Laboratory," process video sequences using gaze- 
contingent techniques. The gaze -contingent processing is 
implemented by adaptively varying image quality within each 
video field, such that image quality is maximal in the 
region most likely to be viewed while being reduced in the 
periphery. This image quality reduction is accomplished by 
blurring the image or by introducing quantization artifacts. 
The system includes an eye tracker with a computer graphic 
flight simulator. Two image sequences are created. One 
sequence has a narrow field of view (19 or 25 degrees) with 
high resolution and the other sequence has a wide field of 
view (76 or 140 degrees) with low resolution. The two image 
sequences are combined optically with the high resolution 
sequence enslaved to the visual system's instantaneous 
center of gaze. To keep the boundary between the two 
regions from being distracting an arbitrary linear rolling 
off (blending) from the high resolution inset image to the 



low resolution image is used. The use of an eye tracker in 
the system is unsuitable for inexpensive video telephones 
where such an eye tracker is not provided. In addition, the 
linear roll-off does not match the eye's sensitivity 
variation, resulting in either variable image quality, or 
unnecessary regions of high resolution. 

Stelmach et al . , in a paper entitled "Processing 
Image Sequences Based On Eye Movements, " disclose a video 
encoding system that employs the concept of varying the 
visual sensitivity as a function of expected eye position. 
The expected eye position is generated by measuring a set of 
observers' eye movements to specific video sequences. Then 
the averaged eye movements are calculated for the set of 
observers. However, such a system requires measurements of 
the eye position which may not be available for inexpensive 
teleconferencing systems. In addition, it is difficult, if 
not impossible, to extend the system to an unknown image 
sequence thus requiring observer measurements for any image 
sequence the system is going to encode. Moreover, variation 
of the resolution is not an efficient technique for 
bandwidth reduction. 

What is desired, therefore, is a video encoding 
system that automatically locates facial regions within the 
video and encodes the video in a manner that provides a 
uniform quality of the video to a viewer. 

SUMMARY OF THE PRESENT INVENTION 

The present invention overcomes the aforementioned 
drawbacks of the prior art by providing a system for 
encoding video that detects the location of a facial region 
of a frame of the video. Sensitivity information is 
calculated for each of a plurality of locations within the 
video based upon the location of the facial region. The 
frame is encoded in manner that provides a substantially 
uniform apparent quality of the plurality of locations to 



the viewer when the viewer is observing the facial region of 
the video. 

In one embodiment, the detection of the facial 
region includes receiving a first frame and a subsequent 
frame of the video, each of which including a plurality of 
pixels. A difference image is calculated representative of 
the difference between a plurality of the pixels of the 
first frame and a plurality of the pixels of the subsequent 
frame. A plurality of candidate facial regions are 
determined within the difference image, preferably based on 
a transform of the difference image in a spacial domain to a 
parameter space. The plurality of candidate facial regions 
are fitted to the difference image to select one of the 
candidate facial regions. 

In another embodiment, the detection of the facial 
region includes fitting the candidate facial regions to the 
difference image to select one of the candidate facial 
regions based on a combination of at least two of the 
following three factors including, a fit factor 
representative of the fit of the candidate facial regions to 
the difference image, a location factor representative of 
the location of the candidate facial regions within the 
video, and a size factor representative of the size of the 
candidate facial regions. 

In yet another embodiment, the sensitivity 
information is calculated for each of the plurality of 
locations within the video based upon both the location of 
the facial region within the video in relation to the 
plurality of locations and a non- linear model of the 
sensitivity of a human visual system. 

In a further embodiment, a target bit value equal 
to a total number of bits available for encoding the frame 
is identified. The sensitivity information is calculated 
for each one of the blocks based upon the sensitivity of a 
human visual system observing a particular region of the 
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image. Quantization values for each of the multiple blocks 
are calculated to provide substantially uniform apparent 
quality of each of the blocks in the frame subject to a 
constraint that the total number of bits available for 
encoding the frame is equal to the target bit value. The 
blocks are encoded with the quantization values. 

The foregoing and other objectives, features, and 
advantages of the invention will be more readily understood 
upon consideration of the following detailed description of 
the invention, taken in conjunction with the accompanying 
drawings . 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of an exemplary 
embodiment of a face detection module of the present 
invention. 

FIG. 2 is an example of location weightings for 
centers of the face detection module of FIG. 1. 

FIG. 3 is an example of radii limits of the face 
detection module of FIG. 1. 

FIG. 4 is an example of centers of considered 
ellipses of the face detection module of FIG. 1. 

FIG. 5 is a block diagram of an exemplary 
embodiment of a visual model of the present invention. 

FIG. 6 illustrates the relationship between the a 
distance on the display of a viewer's focus and the 
resulting visual angle of the viewer. 

FIG. 7 illustrates an eccentricity in visual angle 
for each location as a function of the distance from the 
detected region boundary. 

FIG. 8 illustrates an eccentricity versus location 
for a series of viewing distances. 

FIG. 9 illustrates a set of visual sensitivity 
data sets for absolute sensitivity of the human visual 
system. 
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FIG. 10 illustrates the visual sensitivity as a 
function of pixel location. 

FIG. 11 illustrates the resulting cross section of 
sensitivity values for an elliptical object. 
5 FIG. 12 is an exemplary embodiment of a block 

diagram of a block-based image encoding system of the 
present invention. 

FIG. 13 illustrates a set of quantization steps 
versus block number for one row of blocks in a frame. 
10 FIG. 14 is an exemplary block diagram of an 

encoder including the face detection module of FIG. 1, the 
visual model of FIG. 5, and the block-based image encoding 
system of FIG. 12, of the present invention. 

FIG. 15 is an exemplary block diagram of a decoder 
15 of the present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

In very low bit rate video teleconferencing 
systems, state-of-the-art coding techniques produce 

2 0 artifacts which are systematically present throughout coded 

images. The number of artifacts increases with both 
increased motion between frames and increased image texture. 
In addition, the artifacts usually affect all areas of the 
image without discrimination. However, viewers will mostly 
25 notice those coding artifacts in areas of particular 
interest to them. In particular, a viewer of a video 
teleconferencing system or video telephone will typically 
focus his or her attention to the face(s) of the person (s) 
on the screen, rather than to areas such as clothing or 

3 0 background. While fast motion may mask many coding 

artifacts, the human visual system has the ability to lock 
on and track particular moving objects, such as a person's 
face. Accordingly, communication between viewers of very 
low bit -rate video teleconferencing systems or video 
35 telephones will be intelligible and pleasing only when the 
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person's face and facial features are not plagued with an 
excessive amount of coding artifacts. 

Referring to FIG. 1, a face detection module 10 
receives frame i 12 and frame i+n 14, each consisting of a 
5 plurality of pixels. Frames i and i+n may be immediately 
successive frames or frames spaced apart by n frames. 
Frame i 12 is resized by a scale image block 16 to reduce 
its number of pixels. The pixel reduction reduces the 
computational requirements of the system by narrowing the 
10 search space. Frame i+n 14 is likewise resized by a scale 
image block 18 in the same manner as the scale image block 
16. 

A scale factor used by the scale image blocks 16 
and 18 is variable so that the resulting number of pixels 
15 may be selected to provide sufficient image detail. When 
initially detecting a face within a sequence of images the 
scale factor is preferable twice (or any other suitable 
value) the scale factor used in subsequent calculations. 
Thus, when initially detecting a face the resulting image 
2 0 will include substantially more pixels than during 
subsequent tracking. This insures a good initial 
determination of the face location and reduces the 
computational requirements of subsequent tracking 
calculations since the initial location is used as a 
25 starting location to narrow the search space. 

In many applications where a person's head is 
moving on a constant background, such as a video telephone, 
the movement of the head may be detected by subtracting two 
frames of the video from one another. The resulting non- 
30 zero values from the subtracted frames may be representative 
of motion, such as the head of a person. A difference image 
block 2 0 subtracts the scaled images from the scale image 
blocks 16 and 18 from one another to obtain a difference 
image 21. More particularly, the difference image 21 is 
35 obtained by: d i+n (k) =l i+n (k) -1, (k) , where k is the spatial 
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location of the pixel in the resized image frame, 1 is the 
luminance, and the subscripts indicate the temporal location 
of the image frame. 

A clean difference image block 22 attempts to 
5 remove undesirable non-zero pixels from the difference image 
21 and add desirable non-zero pixels to the difference image 
21. Initially, the clean difference image block 22 performs 
a thresholding operation on the difference image 21: 
d i+n th (k) = 0; |d i+n (k)| <T 

10 1; |d i+n (k) | >T 

where T is a predefined threshold value and | . | denotes 
absolute value. The resulting thresholded image is 
represented by 1 1 s and 0 f s. Thereafter, morphological 
operations are performed on the thresholded image which, for 

15 example count the number of non-zero pixels in a plurality 

of regions of the image. If the non-zero pixel count within 
a region is sufficiently small then all the pixels within 
that region are set to zero. This removes scattered noise 
within the thresholded image. Likewise, if the non-zero 

2 0 pixel count within a region is sufficiently large then all 

the pixels within that region are set to one. This enhances 
those regions of the image indicative of motion. The 
overall effect of the morphological operations is that 
scattered ungrouped non-zero pixels are set to zero and 

25 holes (indicated by zeros) in the grouped non-zero regions 
are set to one. The output from the clean difference image 
block 22 is a cleaned image 23. 

Next, the facial regions are identified within the 
cleaned image. A face generally has an elliptical shape so 

30 the face detection module 10 adopts an ellipse as a model to 
represent a face within the image. Although the upper 
(hair) and lower (chin) areas in actual face outlines may 
have quite different curvatures, ellipses provide a good 
trade-off between model accuracy and parametric simplicity. 

35 Moreover, due to the fact that the elliptical information is 
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not actually used to regenerate the face outline, a small 
lack of model -fitting accuracy does not have any significant 
impact on the overall performance of the coding process. 

An ellipse of arbitrary size and "tilt" can be 
represented by the following quadratic, non-parametric 
equation (implicit form) : 

ax 2 + 2bxy + cy 2 + 2dx + 2ey + f = 0 

b2 - ac < 0 

To reduce the computational requirements, the 
system initially identifies the top portion of a person's 
head with a model of a circle with an arbitrary radius and 
center. The top of the head has a predictable circular 
shape for most people so it provides a consistent indicator 
for a person. Decision block 25 branches to a select 
candidate circles block 24 if the initial determination of a 
face location for a sequence of video is necessary, such as 
a new video or scene of a video. The select candidate 
circles block 24 identifies candidate circles 27 as the top 
m peaks, where m is a preset parameter, in an accumulator 
array of a Hough transform of the image. A suitable Hough 
transform for circles is: 

A(x c ,y c ,r)=A(x c/ y c/ r)+l V x c ,y c ,r e (x-x c ) 2 + (y-y c ) 2 =r 2 
where A (.,.,.) is the accumulator array and (x,y) are pixels 
in the cleaned difference image 23 which exceed the 
threshold. The Hough transform maps the image into 
a parameter space to identify shapes that can be 
parameterized. By mapping the cleaned difference image to a 
parameter space, the actual shapes corresponding to the 
transform can be identified, in contrast to merely looking 
for a series of pixels in the image space which does not 
accurately detect suitable curvatures of the face, as taught 
by Elef theriadis . 

A score candidate circles block 26 scores each of 
the candidate circles 27, in part, based on the fit of the 
cleaned difference image 23 to the respective candidate 
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circle 27. The fit criterion used is as follows. If C is a 
candidate circle 27, then let M c be a mask such that, 
M c (k)= 1; k inside or on C 
0; otherwise. 

5 A pixel k is on the circle contour, denoted C if if the pixel 
is inside or on the circle, and at least one of the pixels 
in its (2L+1) x (2L+1) neighborhood is not. A pixel k is on 
the circle border, denoted by C e , if the pixel is outside 
the circle, and at least one of the pixels in its (2L+1) x 
10 (2L+1) neighborhood is either inside or on the circle. The 
normalized average intensities I ± and I e are defined: 
Ii=(l/|C i |)Id i+n th (k) where ks Ci 

and 

Ie=(l/iC e |)Zd i+n th (k) where kec ie 

15 where | . | denotes cardinality. The measure of fit is then 
defined as: 

R = (l + IJ/d+Ie) 
A large value of R indicates a good fit of the data to the 
candidate circle 27. In contrast, a small value of R 

20 indicates a poor fit of the data to the candidate circle 27. 

While the respective value of R provides a 
reasonable estimation of the appropriateness of the 
respective candidate circle 27, the present inventors came 
to the realization that video teleconferencing devices have 

25 implicit characteristics that may be exploited to further 

determine the appropriateness of candidate circles. In most 
video telephone applications the head is usually centrally 
located in the upper third of the image. Moreover, the size 
of the face is usually within a range of sizes and thus 

3 0 candidate circles that are exceedingly small or excessively 
large are not suitable. Accordingly, in addition to the fit 
data, the score candidate circles block 2 6 also examines the 
size and location of the circle. Referring to FIG. 2, the 
outer border region 40 of a display 38 is an unsuitable 

35 location for a center of a candidate circle 27, the central 
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upper third region 42 of the display 3 8 is a desirable 
location for the center of a candidate circle 27, and the 
remaining region 44 of the display 38 is acceptable. For 
example, the undesirable outer border region 4 0 may have a 
5 weighting factor of 0.25, the acceptable region 44 a 

weighting factor of 0.5, and the desirable central upper 
third region 42 a weighting of 1.0. Referring to FIG. 3, 
the radii of candidate circles 2 7 likewise have a similar 
distribution of suitability. A candidate circle 27 with a 

10 radius less than the small radii 50 is undesirable and may 
be given a weighting of 0 . 1 . A candidate circle 27 with a 
radius between the small radii 50 and an intermediate radii 
52 is desirable and may be given a weighting of 0.6. The 
remainder of the possible large candidate circle radii 53 

15 are undesirable and may be given a weighting of 0.2. Any 
other suitable weighting factors may be used for the radii 
and locations. 

The three parameters used to determine the 
suitability of a candidate circle are the fit of the 

2 0 candidate circle to the cleaned image data, the location of 

the candidate circle's center, and the size of the candidate 
circle's radii. Any suitable ratio of the three parameters 
may be used, such as (0 . 5) *f it+ ( . 25) *center+ ( . 25) *size . The 
candidate circles with the highest score are subsequently 
25 used as potential locations of the face in the image for 
matching with candidate ellipses to more accurately model 
the face. Using circles for the initial determination 
provides a fast computationally efficient technique for 
determining candidate face locations. 

3 0 After the initial candidate circles are determined 

and scored, a generate candidate ellipses block 28 generates 
a set of potential ellipses 29 to be matched to the cleaned 
image 23 for each candidate circle with a sufficient score. 
Ellipses with a center in the region around the center of 
35 the suitable candidate circle and a set of radii in the 
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general range of that of the respective candidate circle are 
considered. Referring to FIG. 4, the centers of considered 
ellipses include a set of ellipse centers 31 in a range in 
the horizontal direction and the vertical direction about 
5 the center 33 of the respective candidate circle. The range 
of candidate ellipse centers in the vertical direction, "Y," 
is greater than the range of candidate ellipse centers in 
the horizontal direction, "X." The reason for the increased 
variability in the vertical direction is because faces tend 

10 to have an elliptical shape in the vertical direction, so 
increased variability in the vertical direction of the 
center of the candidate ellipse permits a better fit to the 
actual face location. In contrast, faces tend not to vary 
much in the horizontal direction so less variability is 

15 necessary. In other words, based on the location of initial 
second candidate circles there is more confidence in the 
centers in the horizontal direction than the vertical 
direction. This difference in variability helps reduce the 
number of candidate ellipses considered which reduces the 

2 0 computational requirements of the system. 

Preferably, a set of candidate ellipses for each 
circle center are considered with centers within a region 47 
around the circle center 33 and having radii somewhat less 
than and greater than the radii of the circle. 
25 The candidate ellipses 2 9 from the generate 

candidate ellipses block 28 are then scored by the score 
candidate ellipses block 32 which scores the candidate 
ellipses 29 using the same fit criteria as the score 
candidate circles block 26, except that an elliptical mask 

3 0 is used instead of a circular one. The score candidate 

ellipses block 32 may additionally use the center location 
of the ellipse and its radii's as additional parameters, if 
desired. 

The candidate ellipse 3 9 with the highest score is 
35 then output 41 by an output top candidate block 34 to the 
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remainder of the system. The parameters provided by the 
output top candidate block 34 are: 

center x horizontal location of ellipse center 

center y vertical location of ellipse center 

5 radius x x axis radius of ellipse 

radius y y axis radius of ellipse 

angle 9 tilt. 
The tilt parameter is optional. 

An alternative is to use circles throughout the 
10 face detection module 10 and remainder of the system as 

being sufficient matches to a face and output the parameters 
of the circle. The parameters of a circle are: 

center x horizontal location of circle center 

center y vertical location of circle center 

15 radius radius of circle. 

It is to be understood that other parameters of 
the ellipse or circle may likewise be provided, if desired, 
such as a diameter which is representative of its respective 
radius . 

20 To track the face between successive frames, 

another two frames of the video are obtained and cleaned by 
the face detection module 10. The initial determination 
block 25 determines that the initial face location has been 
determined and passes the cleaned image 23 from the clean 

25 difference image block 22 to a select candidate ellipses 

block 30. The select candidate ellipses block 30 determines 
a set of potential ellipses based on the previous top 
candidate ellipse 41 from output top candidate block 34. 
The set of potential ellipses is selected from a 

3 0 substantially equal range of centers in both the horizontal 
and vertical directions. There is no significant reason to 
include the variability of the generate candidate ellipses 
block 28 used for the initial face position because the 
location of the face is already determined and most likely 

35 has not moved much. Subsequent tracking involves following 
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the motion of the head itself where motion is just as likely 
in either the vertical and the horizontal directions, as 
opposed to a determination of where the head is based on an 
inaccurate circular model. The radii selected for the 
5 candidate ellipses (x and y) are likewise in a range similar 
to the previous top candidate ellipse 41 because it is 
unlikely that the face has become substantially larger or 
substantially smaller between frames that are not 
significantly temporally different. Reducing the difference 

10 in variability between the horizontal and vertical 

directions reduces the computational requirements that would 
have been otherwise required. 

The candidate ellipses from the select candidate 
ellipses block 30 are passed to the score candidate ellipses 

15 block 32, as previously described. The output top candidate 
block 34 outputs the candidate ellipse with the highest 
score. The result is that after the initial determination 
of the face location the face detection module 10 tracks the 
location of the face with the video. 

2 0 The face detection module 10 may be extended to 

detect multiple faces. In such a case the output top 
candidate block 34 would output a set of parameters for each 
face detected. 

Alternative face detection techniques may be used 
25 to determine the location of the face within an image. In 
such a case the output of the face detection module is 
representative of the location of the face and its size 
within a video. 

If desired, a gaze detection module which detects 
30 the actual viewer's eye position may be used to determine 

the location of the region of interest to the viewer, within 
a video. This may or may not be a face. 

The present inventors came to the realization that 
the human eye has a sensitivity to image detail that is 

3 5 dependant on the distance to the particular pixels of the 
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image and the visual angle to the particular pixels of the 
image. Referring to FIG. 5, the system includes a non- 
linear visual model 60 of the human eye to determine 
appropriate weighting for each of the pixels or regions of 
5 the image . 

The visual model 60 calculates the sensitivity of 
the human eye versus the location within the image. 
Referring to FIG. 6, the visual model 60 initially 
determines the relationship between a distance 62 on the 

10 display 3 8 of the viewers focus 64 and the resulting visual 
angle 66 of the viewer 68 to the end of the distance 62 . 
The visual angle 66 will depend on the anticipated viewing 
distance of the viewer. The angular relationship is 
preferably specified in multiples of image heights or pixel 

15 heights, as opposed to absolute distances. The angular 
relationship is also preferably set for the particular 
system based upon the expected viewing distance and 
particular display 38. Alternatively, the angular 
relationship could be determined by a sensor determining the 

2 0 viewing distance together with information regarding the 
particular display 38. 

Referring to FIGS. 5 and 7, the visual model 60 
calculates at block 62 an eccentricity in visual angle for 
each pixel, location, or region as a function of the 

25 distance 63 from the detected region boundary 65 of the face 
from the output 41 of the output top candidate block 34. 
The pixel distance from the region boundary is: 



35 
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where O e is the eccentricity in units of visual angle, y is 
the vertical pixel position in the image frame, and x is the 
horizontal pixel position. The following four parameters 
are the outputs from the face detection module 10: y c/ c x/ 
5 y r/ and x r/ where x c and y c are the (x,y) center positions of 
the selected ellipse in the frame, and x r and y r are the 
elliptical radii in the horizontal and vertical directions, 
respectively (i.e., the horizontal and vertical minor and 
major axes are 2x r and 2y r , respectively) . V is the viewing 

10 distance in the units of pixel distances, (e.g., in viewing 
an image with a height of 512 lines of pixels with a viewing 
distance of 2 picture heights, V=2*512=1024) . 

Referring to FIG. 8, a graph of the eccentricity 
(in visual angle) for a single pixel location for a series 

15 of viewing distances, from 1 image height to 6 image 

heights, is shown. The viewing location is the center of a 
640 by 480 pixel display. For example, a viewer at a 
distance of 6 image heights 70 observes 6 degrees 
eccentricity in comparison to a larger 3 5 degrees of 

2 0 eccentricity at a distance of 1 image height 72 when looking 
at the edge of the display 38. It is noted that x R and y R 
are both zero in FIG. 8. 

The visual angle of the viewer to each pixel of 
the image is then used as a basis of calculating, at block 

25 63, the viewer's sensitivity to each pixel or block based on 
a non- linear model of the human visual system. Referring to 
FIG. 9 and the eccentricity calculation of FIG. 8, a set of 
measured data sets 80 and 82 (actual data) for absolute 
sensitivity of the human visual system is obtained across 

30 all frequencies. The data sets 80 and 82 are used to 

determine the maximum sensitivity to the frequency response 
of the human visual system. A Cortical Magnification 
Function (CMF) (shown below) fits the data well and provides 
data set 84, which is a function of how many brain cells are 

35 allocated to each visual field location. In essence, FIG. 9 
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illustrates a non-linear actual model of the sensitivity of 
the human visual system as a function of eccentricity. The 
sensitivity can be normalized for use in general rate 
control or an absolute value where visually lossless quality 
5 is needed. Applying the sensitivity data of FIG. 9 to a 
pixel image results in an image of the same size as the 
original pixel image (or of a macro block sampled image) and 
gives the visual sensitivity as a function of pixel 
location, as shown in FIG. 10. The CMF equation governing 
10 data set 84 FIG. 10 is: 



where S is the visual sensitivity, K ECC is a constant 

15 (preferred value is 0.24), and 0 E is the eccentricity in 

visual angle as given in the CMF equation. The CMF equation 
is referred to as the Cortical Magnification Function. The 
result is a sensitivity image, or map, that can be 
determined at any desired resolution with respect to the 

20 starting image sequence. The CMF equation may also be 
applied to the image where the viewer is observing any 
arbitrary location 90, resulting in different sensitivity 
values for the pixels. In the preferred embodiment, the 
location 90 is the top candidate ellipse 41 for the 

25 particular frame. 

FIG. 11, illustrates the resulting cross section 
of the sensitivity values for an elliptical object with a 
radius of 100, centered at position 96 (solid line) and at 
position 98 (dashed line) . It is also possible to use the 

3 0 visual weighting of the image for multiple elliptical (or 
other shapes) regions of importance. It is noted that the 
cross sectional region of the indicated facial region is 
constant, namely 1. 

It is to be understood that other non- linear 

35 models based on the actual human visual system may likewise 
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be used to associate sensitivity information with pixels, 
locations, or regions of an image. 

The visual model 60 produces sensitivity 
information as a function of the location within the image 
5 in relation to a model of the human visual system. The 
values preferably range from 0 to 1, where 1 is the most 
sensitive. Referring again to FIG. 1, the image has a 
sensitivity associated with each region, block, or pixel of 
the image. The video frame 14 needs to be encoded by an 

10 encoder 10 0 and then stored or transmitted with a pre- 
selected target number of bits, suitable for the particular 
system. The following description is based on a typical 
block-based image encoder 100, but it is to be understood 
that any other encoder may likewise be used, such as a 

15 region or pixel based encoder. 

Referring to FIG. 12, in a block-based image 
encoding system, such as MPEG-1, MPEG-2, H.261, and H.263, 
the image (frame) to be encoded is decomposed into a 
plurality of image blocks 101 of the same size, typically of 

2 0 16x16 pixels per block. The pixel values of each block are 

transformed by a block transform 102 into a set of 
coefficients, preferably by using a Discrete Cosine 
Transform (DCT) . The resulting coefficients are quantized 
by a block quantizer 104 and then encoded by a coder 106. 
25 The quantization of the transformed coefficients 

determines the quality of the encoding of each image block 
101. The quantization of the ith image block 101 is 
controlled by only one parameter, Q ir within the block 
quantizer 104. In the H.261 and the H.263 video encoding 

3 0 standards, Q L is referred to as the quantization step for 

the ith block and its value corresponds to half the step 
size used for quantizing the transformed coefficients. In 
the MPEG-1 and the MPEG-2 standards, Qi is referred to as 
the quantization scale and the jth coefficient of a block is 
35 quantized using a quantizer of step size QiWj, where Wj is 
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the jth value of a quantization matrix selected by the 
designer of the MPEG codec. The H.261, H.263, MPEG-1, and 
MPEG-2 standards are incorporated by reference herein. 

The number of bits produced when encoding the ith 
5 image block, B if is a function of the value of the 

quantization parameter Q x and the statistics of the block. 
If Qi is small, the image block is quantized more accurately 
and the image block quality is higher, but such a fine 
quantization produces a large number of bits (large for 
10 the image block. Coarser quantization (large Q ± ) produces a 
fewer number of bits (small Bj but the image quality is 
also lower. 

In image coding, the image blocks are said to be 
intracoded, or of class intra. In video encoding, many of 

15 the blocks in a frame are similar to corresponding blocks in 
previous frames. Video systems typically predict the value 
of the pixels in the current block from previously encoded 
blocks and only the difference or prediction error is 
encoded. Such predicted blocks are said to be intercoded, 

2 0 or of the class inter. The techniques described herein are 
suitable for intra, inter, or both intra and inter blocks 
encoding techniques. 

Referring to FIG. 13, a set of quantization steps 
Qj versus block number j for one row of blocks in a frame is 

2 5 shown. There are three different video coding strategies 

discussed below. Each technique is first briefly discussed 
then the latter two are discussed in greater detail. 

FIRST VIDEO CODING STRATEGY 
The first strategy is represented by line 12 0, 

3 0 which uses the same quantization value Q for all the blocks 

in the row. This may be referred to as the fixed-Q method. 
The resulting number of bits to encode the row of blocks is 
referred to as B. 
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SECOND VIDEO CODING STRATEGY 
The second strategy is represented by the 
staircased line 122. Q 3 is set to Q for the block closest 
to the location where the system has determined that the 
5 viewer is observing, such as the face region. In FIG. 13, 
the viewing location is shown as the middle of the row. 
Qj's are selected to be larger than Q for blocks farther 
from the center. Since all the quantization steps are as 
large as or larger than those for the fixed-Q strategy 12 0, 

10 the staircased line 122 technique will encode the blocks in 
the row with fewer bits. The resulting number of bits 
necessary to encode the row of blocks using the staircased 
line 122 technique is referred to as B', where B'<B. With 
the proper selection of Q 3 for each block the image quality 

15 will appear uniform to the human eye, as described in detail 
below. Accordingly, the perceived quality of the encoded 
images using line 120 or line 122 will be the same, but, as 
mentioned above, using line 122 will produce fewer bits. 

THIRD VIDEO CODING STRATEGY 

20 If the quantization steps of line 122 are reduced 

by a constant, the number of bits necessary to encode the 
blocks will be greater than B'. The staircase line 124 
represents the steps Qj 1 used for encoding the blocks 
resulting in the same number bits B as the line 120. The 

25 blocks of the entire row will be perceived by the viewer as 
having the same image quality, with the proper selection of 
the Qj ' values. The center is quantized with step size of 
Q'<Q, resulting in the image quality at the center having a 
better quality than the fixed-Q technique. Hence, the 

3 0 perceived image quality to the viewer of the entire row, 
which is substantially uniform, will be higher than the 
fixed-Q case, even though both techniques use the same 
number of bits B. The objective of the staircased line 124 
is to find the proper Qj ' values automatically, so that the 
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pre-selected target number of bits (in this case B) is 
achieved. 

DETAILS OF SECOND STRATEGY 
The present inventor came to the realization that 
5 a coarser quantization on image blocks to which the viewer 
is less sensitive can be performed without affecting the 
perceived image quality. In fact, when encoding digital 
video, the quantization factor can be increased according to 
the sensitivities of the human visual system and thereby 
10 decrease the number of bits necessary for each frame. In 
particular, if the entire N blocks of the image are 
quantized and encoded with quantization steps: 
Q/S l7 Q/S 2 , . . .Q/S 

n / Equation 1 

respectively, where S k is the sensitivity associated to the 
15 kth block, the perceived quality of the encoded frame will 
be the same as if all the blocks were quantized with step 
size Q. Since the S k 's are smaller than or equal to 1, the 
resulting quantizers in Equation 1 will be as large as or 
larger than Q, and therefore will produce fewer bits when 
2 0 encoding a given frame. 

To summarize, the result of such an encoding 
scheme where the sensitivities are representative of the 
perceived image quality based on a model of the human visual 
system and varying the quantization factor with respect to 
25 the sensitivity information, provides an image that has a 

perceived uniform quality. This also provides a minimum bit 
rate with the uniform quality. 

The following steps may be used to reduce the 
number of bits for a video frame using a preselected base 
30 quantization step size Q. 

STEP 1. Initially set k equal to 1. 
STEP 2. Find the maximum value of the sensitivity 
for the pixels in the kth block, S k , 

S k = max(S kfl , S kr2 , S k/3 , . . . S kfL ) Equation 2 
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where S k/i is the sensitivity for the ith pixel in the kth 
block. Alternatively, the maximum operation could be 
replaced by any other suitable evaluation of the 
sensitivities of a block, such as the average of the 
5 sensitivities . 

STEP 3. Encode the kth block with a quantizer of 
step size Q/S k . 

STEP 4. If k<N, then let k=k+l and go to step 1. 
Otherwise stop . 
10 DETAILS OF THIRD STRATEGY 

In many system the total number of bits available 
for encoding a video frame is often set in advance by the 
user or the communication channel. Consequently, some rate 
or quantizer control strategy is necessary for selecting the 
15 value of the quantization steps so that the frame target is 
achieved as suggested by line 124 of FIG. 13. In other 
words, selecting the number of bits results in the 
aforementioned base Q likely not matching the available 
bandwidth. 

2 0 A model for the number of bits invested in the ith 

image block is: 

Equation 3 

25 where Q ± is the quantizer step size or quantization scale, A 
is the number of pixels in a block (e.g., in MPEG and H.263 
A=16 2 pixels) , K and C are model parameters (described 
below) . Gi is the empirical standard deviation of the 
pixels in the block, and is defined as: 



Equation 4 
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with Pi(j) the value of the jth pixel in the ith block and 
Pi is the average of the pixel values in the block. Pi is 
defined as, 

Equation 5 

For a color image, the P x (j) 's are the values of the 
luminance and chrominance components for the block pixels. 
The model of Equation 3 was derived using a rate-distortion 
analysis of the block's encoder and is discussed in greater 
detail in co-pending United States Patent Application Serial 
No. 09/008,137, filed January 16, 1998, incorporated by 
reference herein. 

K and C are model parameters. K depends on the 
encoder efficiency and the distribution of the pixel values, 
and C is the number of bits for encoding overhead 
information (e.g., motion vectors, syntax elements, etc.). 
Preferably, the values of K and C are not known in advance 
and are estimated during encoding. 

The objective of the third technique is to find 
the value of the quantization steps that satisfy the 
following two conditions: 

(1) the total number of bits produced for the 
image is a pre- selected target B; and 

(2) the overall image quality is perceived as 
homogenous, constant, or uniform. 

Let N be the number of blocks in the video frame. The first 
condition in terms of the encoder model is: 

Equation 6 

As described in relation to the second strategy, the second 
condition is satisfied by a set of quantizers, 



35 



Equation 7 
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where (Q'/S k )=Q k is the quantization step of the kth block, 
but now Q 1 is not known. 

Combining Equations 6 and 7 the following equation is 
obtained: 

5 

Equation 8 
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The following expression for Q' is obtained from Equation 8. 

Equation 9 

5 Equation 9 is the basis for the preferred rate control 
technique, described below. 

The quantizers for encoding the N image blocks in 
a frame are preferably selected with the following 
technique . 

10 STEP 1. Initialization. Let i=l (first block), 

Bi=B (available bits) , N 2 =N (number of blocks) . Let 

Equation 10 

where a k and S k are defined in equations 4 and 2, 
15 respectively. If the values of the parameters K and C in 
the encoder model are known or estimated in advance, e.g., 
using linear regression, let K X =K and C^C. If the model 
parameters are not known, set K x and C x to some small non- 
negative values, such as K^O.5 and C 1 =0 as initial 
20 estimates. In video coding, one could set K L and C x to the 
values K N+1 and C N+1/ respectively, from the previous encoded 
frame, or any other suitable value. 

STEP 2. The quantization parameter for the ith 
block is computed as follows: 

25 

Equation 11 

If the values of the Q-parameters are restricted to a fixed 
set (e.g., in H.263, Qi=2QP and QP takes values in 
30 {1,2,3,... ,31,}, round Q x to the nearest value in the set. 
The square root operation can be implemented using look-up 
tables . 

STEP 3 . The ith block is encoded with a block- 
based coder, such as the coder of FIG. 12. Let B L } be the 
35 number of bits used to encode the ith block, compute 
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Equation 12 

STEP 4 . The parameters K 1+1 and C 1+1 of the coder 
5 model are updated. For the fixed mode K i+1 =K and C 1+1 =C. For 
the adaptive mode, K i+1 and C i+1 are determined using any 
suitable technique for model fitting. For example, one 
could use the model fitting techniques in co-pending United 
States Patent Application, Serial No. 09/008,137, 

10 incorporated by reference herein. 

STEP 5. If i=N, stop (all image blocks are 
encoded) . If not, i=i+l and go to Step 2. 

ENCODER SYSTEM 
Referring to FIG. 14, an encoder system 200 

15 includes an input image sequence 2 02 which is passed to the 
encoder 100 and a sub-sample block 204 which decomposes the 
image sequence 2 02 into macro-blocks. The macro-blocks from 
the sub- sample block 204 are passed to the visual model 60 
and the face detection module 10. An optional gaze 

20 direction measurement block 206 detects the location of the 
gaze of the viewer. The output from either the measurement 
block 206 or detection module 10 is passed to the visual 
model 60 and optionally to an encode gaze parameters block 
208. Calibration parameters for the pixel size and/or 

2 5 viewing distance are provided to the visual model 60 and the 
encode gaze parameters block 208 by a calibration block 210. 
The visual model 60 provides its sensitivity output to the 
encoder 100. The encoder 100 thereafter transmits encoded 
data to a storage device or a decoder 300. 

30 Referring to FIG. 15, the decoder 300 decodes the 

gaze parameters with a decode gaze parameters block 3 02. A 
visual model block 3 04 calculates both eccentricity versus 
image location and sensitivity versus eccentricity. The 
visual model block 304 provides quantization parameters to 

35 the decode data block 306 which decodes the encoded data 



30 

based on the quantization parameters. An inverse transform 
block 3 08 decomposes the data from the decode data block 3 06 
to obtain a decompressed image sequence 310 for use, such as 
being displayed on a display. 
5 The terms and expressions which have been employed 

in the foregoing specification are used therein as terms of 
description and not of limitation, and there is no 
intention, in the use of such terms and expressions, of 
excluding equivalents of the features shown and described or 
10 portions thereof, it being recognized that the scope of the 
invention is defined and limited only by the claims which 
follow. 
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CLAIMS 




A method of detecting a facial region within 
a video comprising the steps of: 
5 (a) receiving a first frame of said video 

comprising a plurality of pixels; 

(b) receiving a subsequent frame of said video 
comprising a plurality of pixels; 

(c) calculating a difference image representative 
10 of the difference between a plurality of said 

pixels of said first frame and a plurality of 
said pixels of said subsequent frame; 

(d) determining a plurality of candidate facial 
regions within said difference image based on 

15 a transform of said difference image in a 

spacial domain to a parameter space; and 

(e) fitting said plurality of candidate facial 
regions to said difference image to select 
one of said candidate facial regions. 

20 

2 . The method of claim 1 further comprising the 
step of thresholding said difference image thereby removing 
values of said difference image less than a threshold value. 



25 3. The method of claim 2 wherein said threshold 

value is a predetermined value and said removing values is 
setting said values of said difference image that are less 
than said threshold value to a selected value. 



30 4. The method of claim 1 wherein said transform 

is a Hough transform. 
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5. The method of claim 4 wherein said Hough 
transform is 

A(x c ,y c/ r) = A(x c ,y c ,r)+1 V x c ,y c/ r e (x-x c ) 2 + (y-y c ) 2 =r 2 . 

5 6, The method of claim 1 where said fitting of 

each of said candidate facial regions is based on a 
combination of at least three factors including, a fit 
factor representative of a fit of said candidate facial 
regions to said difference image, a location factor 
10 representative of the location of said candidate facial 

regions within said video, and a size factor representative 
of the size of said candidate facial regions. 



7. The method of claim 1 further comprising the 
15 step of scaling said first frame and said subsequent frame 
of said video to reduce the number of said pixels of said 
first and subsequent frame prior to said calculating said 
difference frame. 



20 8. The method of claim 1 wherein said step of 

determining said plurality of said candidate facial regions 
and fitting said plurality of said candidate facial regions 
further comprises the steps of: 

(a) determining a set of candidate circles based 
25 on a Hough transform of said difference 

image ; 

(b) scoring said set of said candidate circles 
based on a combination of at least three 
factors including, a fit factor 

3 0 representative of the fit of said candidate 

circles to said difference image, a location 
factor representative of the location of said 
candidate circles within said video, and a 
size factor representative of the size of 

3 5 said candidate circles; 
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(c) selecting at least one of said candidate 
circles based on said scoring; 

(d) generating at least one candidate facial 
region having an elliptical shape for each of 

5 said at least one of said candidate circles; 

and 

(e) scoring each of said candidate facial regions 
based on a combination of at least three 
factors including, a fit factor 

10 representative of the fit of a respective 

said candidate facial region to said 
difference image, a location factor 
representative of the location of said 
respective said candidate facial region 

15 within said video, and a size factor 

representative of the size of said respective 
said candidate facial region. 

9. The method of claim 8 wherein said generating 
20 at least one candidate facial region has a center of said 
elliptical shape located within a bounded region of 
potential locations having a greater vertical dimension than 
a horizontal dimension centered about the center of said 
respective said candidate circle. 

1/0. A method of detecting a facial region within 
a video comprising the steps of : 

(a) receiving a first frame of said video 
comprising a plurality of pixels; 
3 0 (b) receiving a subsequent frame of said video 

comprising a plurality of pixels; 
(c) calculating a difference frame representative 
of the difference between a plurality of said 
pixels of said first frame and a plurality of 
35 said pixels of said subsequent frame; 
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(d) determining a plurality of candidate facial 
regions within said difference frame; and 

(e) fitting said candidate facial regions to said 
difference image to select one of said 

5 candidate facial regions based on a 

combination of at least two of the following 
three factors including, a fit factor 
representative of the fit of said candidate 
facial regions to said difference image, a 
10 location factor representative of the 

location of said candidate facial regions 
within said video, and a size factor 
representative of the size of said candidate 
facial regions. 

15 

11. The method of claim 10 where said determining 
said candidate facial regions is based on a Hough transform 
of said difference image in a spacial domain to a parameter 
space . 

20 

12. The method of claim 11 wherein said Hough 
transform is 

A(x c ,y c/ r) = A(x c ,y c/ r)+l V x c/ y c/ r e (x-xj 2 + (y-y c ) 2 =r 2 . 



25 13. The method of claim 10 further comprising the 

step of thresholding said difference image thereby removing 
values of said difference image less than a threshold value. 

14. The method of claim 10 further comprising the 
3 0 step of scaling said first frame and said subsequent frame 
of said video to reduce the number of said pixels of said 
first and subsequent frame prior to said calculating said 
difference frame. 
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f5 . A method of determining sensitivity 
information for a video comprising the steps of: 

(a) receiving a first frame of said video; 

(b) receiving a subsequent frame of said video; 
5 (c) determining a location of a facial region 

within said video based on said first and 
subsequent frames; and 
(d) calculating a sensitivity value for each of a 
plurality of locations within said video 
10 based upon both said location of said facial 

region within said video in relation to said 
plurality of locations and a non- linear model 
of the sensitivity of a human visual system. 

15 16. The method of claim 15 wherein the step of 

said calculating said sensitivity values is further based 
upon calculating an eccentricity versus image location in 
relation to a viewer of said video for said plurality of 
locations within said video. 



20 



25 



17. The method of claim 16 wherein said 
calculating said sensitivity is further based upon a 
sensitivity versus eccentricity non-linear model of said 
human visual system. 

18. The method of claim 16 wherein said 
eccentricity is derived according to the following, 



30 



35 



where 0 E is said eccentricity, y is a vertical pixel 
position within said video, x is a horizontal position 
within said video, x c represents a horizontal component of a 
center position of an elliptical said facial region, y c 
represents a vertical component of said center position of 
said elliptical said facial region, x r represents a first 
elliptical radii of said elliptical said facial feature in a 
horizontal direction; y r represents a second elliptical 
radii of said elliptical said facial feature in a vertical 
direction, and V represents a viewing distance. 

19. The method of claim 15 wherein said 
sensitivity values are based upon the distance from the 
outer edge of said facial region to said plurality of 
locations within said video. 

20. The method of claim 17 wherein said 
sensitivity versus eccentricity non- linear model is derived 
according to the following, 



where S is representative of said sensitivity, k ECC is a 
constant, and 9 E is representative of a non- linear contrast 
sensitivity function. 

A method of encoding a video comprising the 

receiving a frame of said video consisting of 
a plurality of pixels; 

calculating sensitivity information for a 
plurality of locations within said video 
calculated based upon the sensitivity of a 
human visual system of a viewer observing a 
particular region of said video; and 



f ?■ 

steps of : 

(a) 
(b) 



(c) encoding said frame in a manner that provides 
a substantially uniform apparent quality of 
said plurality of locations to said viewer 
when said viewer is observing said particular 
region of said video. 

22. The method of claim 21 wherein said encoding 
of each of said plurality of locations is based on a 
respective quantization value representative of a base 
quantization factor divided by said sensitivity information 
for a respective one of said plurality of locations. 

23. The method of claim 22 wherein said encoding 
is derived in accordance with the following: 

Q/Si, Q,S 2 , Q/S 3 , . . ,Q/S N 
where Q is representative of said base quantization factor, 
and S x through S N are representative of said sensitivity 
information for said plurality of locations. 

24. The method of claim 23 wherein one of said 
S k/ where k is a value from 1 to N, is derived based upon a 
statistical calculation of a plurality of said sensitivity 
information for one of said locations of said image. 

25. The method of claim 24 wherein S k is an 
average of said plurality of said sensitivity information. 



26. The method of claim 24 wherein S k is a 
maximum of said plurality of said sensitivity informati 
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27. 



The method of claim 21 wherein said encoding 



said frame results in the total number of bits produced for 
said frame being substantially equal to a preselected 
number . 

28. The method of claim 27 wherein said frame is 
encoded only once . 

29. The method of claim 27 wherein said encoding 
of each of said plurality of locations is based on a 
respective quantization value representative of a base 
quantization factor divided by said sensitivity information 
for a respective one of said plurality of locations. 

30. The method of claim 2 9 wherein said base 
quantization factor is derived in accordance with the 
following : 



where A is representative of the number of pixels in one of 
said plurality of locations, K and C are constants 
associated with said plurality of locations, N is 
representative of the number of said plurality of locations, 
B is representative of said total number of bits, the o L 2 
values are a measure how much texture is associated with 
said plurality of locations, and the S t 2 values are 
representative of the respective said sensitivity 
information squared. 



yt. A method for encoding multiple blocks in a 
frame of image data, comprising: 

(a) identifying a target bit value equal to a 

total number of bits available for encoding 
the frame; 




(b) calculating sensitivity information for each 
one of the blocks based upon the sensitivity 
of a human visual system observing a 
particular region of the image; 

(c) adapting quantization values for each of the 
multiple blocks to provide substantially 
uniform apparent quality of each of the 
blocks in the frame subject to a constraint 
that the total number of bits available for 
encoding the frame is equal to the target bit 
value; and 

(d) encoding the blocks with the quantization 
values . 

32. The method of claim 31 wherein the 
quantization values are derived according to the following, 



w here, Q ± is the quantization value for each block i , N is 
the number of blocks in the frame, B is the total number of 
bits available for encoding the frame, A is a number of 
pixels in each of the multiple blocks, K and C are constants 
associated with the image blocks, Oi is an empirical 
standard deviation of pixel values in the block, and S x is a 
weighting incorporating the sensitivity information for the 
block. 

33. The method of claim 31 including adjusting 
the quantization values according to a number of image 
blocks remaining to be encoded, a number of bits still 
available for encoding the remaining image blocks, and a 
value that depends on the sensitivity and texture of the 
remaining image blocks. 
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34. The method of claim 32 including using a K 
parameter and a C parameter on a block-by-block basis to 
adjust the quantization values for each of the multiple 
blocks, the K parameter modeling correlation statistics of 

5 the pixels in the image blocks and the C parameter modeling 
bits required to code overhead data. 

35. The method of claim 34 including deriving the 
optimum quantization values in either a fixed mode where the 

10 K and C parameters are known in advance or an adaptive mode 
where the K and C parameters are derived according to the K 
and C parameters of previously encoded blocks . 



36. The method of claim 35 wherein the adaptive 
15 mode includes the following steps: 

(a) deriving values for the K and C parameters 
that exactly predict the number of bits B 
used for encoding previous blocks; 

(b) deriving averages for the derived K and C 

2 0 parameters for the previously encoded video 

blocks; and 

(c) predicting the K and C parameters for a next 
video block by weighting the average K and C 
parameters according to the initial estimates 

2 5 for the K and C parameters. 



A method for encoding video comprising the 
steps of: 

(a) detecting the location of a facial region of 
3 0 a frame of said video; 

(b) calculating a sensitivity value for each of a 
plurality of locations within said video 
based upon said location of said facial 
region; and 
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(c) encoding said frame in manner that provides a 
substantially uniform apparent quality of 
said plurality of locations to said viewer 
when said viewer is observing said facial 
5 region of said video. 



38. The method of claim 37 wherein said 
sensitivity values are calculated based upon said location 
of said facial region, a size of said facial region, and a 
10 non-linear model of the sensitivity of a human visual 
system. 



39. The method of claim 37 wherein said detecting 
said location of said facial region of said frame further 
15 comprises the steps of: 

(a) receiving a first frame of said video 
comprising a plurality of pixels; 

(b) receiving a subsequent frame of said video 
comprising a plurality of pixels; 

20 (c) calculating a difference image representative 

of the difference between a plurality of said 
pixels of said first frame and a plurality of 
said pixels of said subsequent frame; 

(d) determining a plurality of candidate regions 
25 within said difference image; and 

(e) fitting said plurality of candidate regions 
to said difference image to select said 
facial region. 



30 40. The method of claim 39 wherein said 

determining said plurality of candidate regions is based on 
a Hough transform of said difference image in a spacial 
domain to a parameter space. 
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41. The method of claim 39 further comprising the 
step of thresholding said difference image thereby removing 
values of said difference image less than a threshold value. 

5 42. The method of claim 41 wherein said threshold 

value is a predetermined value and said removing values is 

setting said values less than said threshold value to a 
selected value. 

10 43. The method of claim 40 wherein said Hough 

transform is 

A(x c/ y c/ r) = A(x c/ y c/ r)+l V x c ,y c/ r e (x-x c ) 2 + (y-y c ) 2 =r 2 . 

44. The method of claim 3 9 where said fitting of 
15 each of said candidate regions is based on a combination of 
at least two of the following three factors including, a fit 
factor representative of a fit of said candidate regions to 
said difference image, a location factor representative of 
the location of said candidate regions within said video, 
20 and a size factor representative of the size of said 
candidate regions . 



45. The method of claim 39 further comprising the 
step of scaling said first frame and said subsequent frame 
2 5 of said video to reduce the number of said pixels of said 
first and subsequent frame prior to said calculating said 
difference frame. 



46. The method of claim 39 wherein said step of 
30 determining said plurality of said candidate regions and 
fitting said plurality of said candidate regions further 
comprises the steps of: 

(a) determining a set of candidate circles based 
on a Hough transform of said difference 
3 5 image ; 
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(b) scoring said set of said candidate circles 
based on a combination of at least three 
factors including, a fit factor 
representative of the fit of said candidate 
5 circles to said difference image, a location 

factor representative of the location of said 
candidate circles within said video, and a 
size factor representative of the size of 
said candidate circles; 

10 (c) selecting at least one of said candidate 

circles based on said scoring; 
(d) generating at least one candidate region 

having an elliptical shape for each of said 
at least one of said candidate circles; and 

15 (e) scoring each of said candidate regions based 

on a combination of at least three factors 
including, a fit factor representative of the 
fit of a respective said candidate region to 
said difference image, a location factor 

2 0 representative of the location of said 

respective said candidate region within said 
video, and a size factor representative of 
the size of said respective said candidate 
region. 

25 

47. The method of claim 46 wherein said 
generating at least one candidate region has a center of 
said elliptical shape located within a bounded region of 
potential locations having a greater vertical dimension than 

3 0 a horizontal dimension centered about the center of said 

respective said candidate circle. 
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48. The method of claim 38 wherein said wherein 
the step of said calculating said sensitivity values is 
further based upon calculating an eccentricity versus image 



44 

location in relation to a viewer of said video for said 
plurality of locations within said video. 

49. The method of claim 48 wherein said 
5 eccentricity is derived according to the following, 



10 



where 9 E is said eccentricity, y is a vertical pixel 
15 position within said video, x is a horizontal position 

within said video, x c represents a horizontal component of a 
center position of an elliptical said facial region, y c 
represents a vertical component of said center position of 
said elliptical said facial region, x r represents a first 
20 elliptical radii of said elliptical said facial feature in a 
horizontal direction; y r represents a second elliptical 
radii of said elliptical said facial feature in a vertical 
direction, and V represents a viewing distance. 

25 50. The method of claim 38 wherein said 

sensitivity values are based upon the distance from the 
outer edge of said facial region to said plurality of 
locations within said video. 

30 51. The method of claim 38 wherein said 

sensitivity versus eccentricity non-linear model is derived 
according to the following, 
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45 



where S is representative of said sensitivity, k ECC is a 
constant, and 0 E is representative of a non- linear contrast 
sensitivity function. 

5 

52. The method of claim 37 further comprising the 
step of encoding said frame in manner that provides a 
substantially uniform apparent quality of said plurality of 
locations to said viewer when said viewer is observing said 

10 facial region of said video. 

53. The method of claim 37 wherein said encoding 
of each of said plurality of locations is based on a 
respective quantization value representative of a base 

15 quantization factor divided by said sensitivity information 
for a respective one of said plurality of locations. 

54. The method of claim 53 wherein said encoding 
is derived in accordance with the following: 

20 Q/S lf Q,S 2 , Q/S 3/ . . ,Q/S N 

where Q is representative of said base quantization factor, 
and S 1 through S N are representative of said sensitivity 
information for said plurality of locations. 

25 55. The method of claim 54 wherein one of said 

S k# where k is a value from 1 to N, is derived based upon a 
statistical calculation of a plurality of said sensitivity 
information for one of said locations of said image. 

30 56. The method of claim 55 wherein S k is an 

average of said plurality of said sensitivity information. 

57. The method of claim 55 wherein S k is a 
maximum of said plurality of said sensitivity information. 

35 
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58- The method of claim 52 wherein said encoding 
said frame results in the total number of bits produced for 
said frame being substantially equal to a preselected 
number . 

5 

59. The method of claim 58 wherein said frame is 
encoded only once . 

60. The method of claim 58 wherein said encoding 
10 of each of said plurality of locations is based on a 

respective quantization value representative of a base 
quantization factor divided by said sensitivity information 
for a respective one of said plurality of locations. 

15 61- The method of claim 60 wherein said base 

quantization factor is derived in accordance with the 
following : 



20 

where A is representative of the number of pixels in one of 
said plurality of locations, K and C are constants 
associated with said plurality of locations, N is 
representative of the number of said plurality of locations, 
2 5 B is representative of said total number of bits, the a* 
values are a measure how much texture is associated with 
said plurality of locations, and the S t 2 values are 
representative of the respective said sensitivity 
information squared. 



METHOD FOR ADAPTING QUANTIZATION IN VIDEO CODING USING 
FACE DETECTION AND VISUAL ECCENTRICITY WEIGHTING 



5 ABSTRACT 

A system encodes video by detecting the location 
of a facial region of a frame of the video. Sensitivity 
information is calculated for each of a plurality of 
locations within the video based upon the location of the 
10 facial region. The frame is encoded in manner that provides 
a substantially uniform apparent quality of the plurality of 
locations to the viewer when the viewer is observing the 
facial region of the video. 
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We hereby declare that all statements made herein of my own 
knowledge are true and that all statements made on information 
and belief are believed to be true; and further that these 
statements were made with the knowledge that willful false 
statements and the like so made are punishable by fine or 
imprisonment, or both, under Section 1001 of Title 18 of the 
United States Code and that such willful false statements may 
jeopardize the validity of the application or any patent 
issued thereon. 



Dated: 

Full name of sole inventor 

Residence 

Citizenship 

Post Office Address 



Scott J. Daly 
Kalama , Washington 
United States of America 
2 80 Simmons Spur Road 
Kalama, WA 98625 



Dated : o^/3tl^ 

Full name of first joint inventor 

Residence 

Citizenship 

Post Office Address 



Kristine E. Matthews 
Vancouver , Washington 
United Sta^fes of America 
501 SE lMSrd Avenue, U-147 
Vancouver ^IvA 98 683 




Dated: 0S/3S/q2 



Full name of first joint inventor 

Residence 

Citizenship 

Post Office Address 



Jordi RibaJ[-Corbera 
Vancouver^ |Washington 
United States of America 
16604 SE Fisher Drive 
Vancouver, WA 986 83 



